Lead Scoring Project

¶

Author: Marcelo Cruz
Feel free to contact me: https://www.linkedin.com/in/marcelo-cruz-segura

Table of Content¶

  • 1. Problem Context

  • 2. Prepare Work Enviroment

  • 3. Load and inspect data

  • 4. Data cleaning & Feature Engineering

  • 5. Explore missing values

  • 6. Exploratory Data Analysis

  • 7. Data Wrangling

  • 8. Modeling

  • 9. Select the best models and tune them

  • 10. Make our predictions

  • 11. Conclusions

1. Problem Context

¶

Lead scoring is a process of assigning scores to prospects based on their profile and behavioral data in order to prioritize leads, improve close rates, and decrease buying cycles.

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not.

The typical lead conversion rate at X education is around 30%.

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’.

If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.

There are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating, etc. ) in order to get a higher lead conversion.

1.1. Business Goal¶

  • Goal from a business perspective:
    X Education wants to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

  • Goal from a Data Scientist perspective:
    Our mission is to build a better lead scoring model, targeting an 80% conversion rate and precision score. Using predict_proba(), we'll assess lead probabilities. This project aims to gain insights and emphasize a data-driven approach for success.

1.2 Data Dictionary¶

Variable Description
Prospect ID A unique ID with which the customer is identified.
Lead Number A lead number assigned to each lead procured.
Lead Origin The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc.
Lead Source The source of the lead. Includes Google, Organic Search, Olark Chat, etc.
Do Not Email An indicator variable selected by the customer wherein they select whether or not they want to be emailed about the course or not.
Do Not Call An indicator variable selected by the customer wherein they select whether or not they want to be called about the course or not.
Converted The target variable. Indicates whether a lead has been successfully converted or not.
TotalVisits The total number of visits made by the customer on the website.
Total Time Spent on Website The total time spent by the customer on the website.
Page Views Per Visit Average number of pages on the website viewed during the visits.
Last Activity Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc.
Country The country of the customer.
Specialization The industry domain in which the customer worked before. Includes the level 'Select Specialization' which means the customer had not selected this option while filling the form.
How did you hear about X Education The source from which the customer heard about X Education.
What is your current occupation Indicates whether the customer is a student, unemployed or employed.
What matters most to you in choosing this course An option selected by the customer indicating what is their main motto behind doing this course.
Search Indicating whether the customer had seen the ad in any of the listed items.
Magazine
Newspaper Article
X Education Forums
Newspaper
Digital Advertisement
Through Recommendations Indicates whether the customer came in through recommendations.
Receive More Updates About Our Courses Indicates whether the customer chose to receive more updates about the courses.
Tags Tags assigned to customers indicating the current status of the lead.
Lead Quality Indicates the quality of lead based on the data and intuition of the employee who has been assigned to the lead.
Update me on Supply Chain Content Indicates whether the customer wants updates on the Supply Chain Content.
Get updates on DM Content Indicates whether the customer wants updates on the DM Content.
Lead Profile A lead level assigned to each customer based on their profile.
City The city of the customer.
Asymmetrique Activity Index An index and score assigned to each customer based on their activity and their profile.
Asymmetrique Profile Index
Asymmetrique Activity Score
Asymmetrique Profile Score
I agree to pay the amount through cheque Indicates whether the customer has agreed to pay the amount through cheque or not.
a free copy of Mastering The Interview Indicates whether the customer wants a free copy of 'Mastering the Interview' or not.
Last Notable Activity The last notable activity performed by the student.

2. Prepare Work Enviroment

¶

2.1 Import Libraries¶

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import warnings
from scipy.stats import linregress, uniform
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, RandomizedSearchCV
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import f1_score, recall_score, roc_auc_score, precision_score, precision_recall_curve, PrecisionRecallDisplay, confusion_matrix

2.2 Suppress Warnings & display options¶

In [29]:
warnings.filterwarnings('ignore')
In [30]:
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

3. Load and inspect data

¶

In [31]:
df = pd.read_csv('https://raw.githubusercontent.com/CeloCruz/LeadScoring/main/Lead%20Scoring.csv')
df.head()
Out[31]:
Prospect ID Lead Number Lead Origin Lead Source Do Not Email Do Not Call Converted TotalVisits Total Time Spent on Website Page Views Per Visit Last Activity Country Specialization How did you hear about X Education What is your current occupation What matters most to you in choosing a course Search Magazine Newspaper Article X Education Forums Newspaper Digital Advertisement Through Recommendations Receive More Updates About Our Courses Tags Lead Quality Update me on Supply Chain Content Get updates on DM Content Lead Profile City Asymmetrique Activity Index Asymmetrique Profile Index Asymmetrique Activity Score Asymmetrique Profile Score I agree to pay the amount through cheque A free copy of Mastering The Interview Last Notable Activity
0 7927b2df-8bba-4d29-b9a2-b6e0beafe620 660737 API Olark Chat No No 0 0.0 0 0.0 Page Visited on Website NaN Select Select Unemployed Better Career Prospects No No No No No No No No Interested in other courses Low in Relevance No No Select Select 02.Medium 02.Medium 15.0 15.0 No No Modified
1 2a272436-5132-4136-86fa-dcc88c88f482 660728 API Organic Search No No 0 5.0 674 2.5 Email Opened India Select Select Unemployed Better Career Prospects No No No No No No No No Ringing NaN No No Select Select 02.Medium 02.Medium 15.0 15.0 No No Email Opened
2 8cc8c611-a219-4f35-ad23-fdfd2656bd8a 660727 Landing Page Submission Direct Traffic No No 1 2.0 1532 2.0 Email Opened India Business Administration Select Student Better Career Prospects No No No No No No No No Will revert after reading the email Might be No No Potential Lead Mumbai 02.Medium 01.High 14.0 20.0 No Yes Email Opened
3 0cc2df48-7cf4-4e39-9de9-19797f9b38cc 660719 Landing Page Submission Direct Traffic No No 0 1.0 305 1.0 Unreachable India Media and Advertising Word Of Mouth Unemployed Better Career Prospects No No No No No No No No Ringing Not Sure No No Select Mumbai 02.Medium 01.High 13.0 17.0 No No Modified
4 3256f628-e534-4826-9d63-4a8b88782852 660681 Landing Page Submission Google No No 1 2.0 1428 1.0 Converted to Lead India Select Other Unemployed Better Career Prospects No No No No No No No No Will revert after reading the email Might be No No Select Mumbai 02.Medium 01.High 15.0 18.0 No No Modified

Shape and info about the dataset

In [32]:
df.shape
Out[32]:
(9240, 37)
In [33]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 non-null   float64
 10  Last Activity                                  9137 non-null   object 
 11  Country                                        6779 non-null   object 
 12  Specialization                                 7802 non-null   object 
 13  How did you hear about X Education             7033 non-null   object 
 14  What is your current occupation                6550 non-null   object 
 15  What matters most to you in choosing a course  6531 non-null   object 
 16  Search                                         9240 non-null   object 
 17  Magazine                                       9240 non-null   object 
 18  Newspaper Article                              9240 non-null   object 
 19  X Education Forums                             9240 non-null   object 
 20  Newspaper                                      9240 non-null   object 
 21  Digital Advertisement                          9240 non-null   object 
 22  Through Recommendations                        9240 non-null   object 
 23  Receive More Updates About Our Courses         9240 non-null   object 
 24  Tags                                           5887 non-null   object 
 25  Lead Quality                                   4473 non-null   object 
 26  Update me on Supply Chain Content              9240 non-null   object 
 27  Get updates on DM Content                      9240 non-null   object 
 28  Lead Profile                                   6531 non-null   object 
 29  City                                           7820 non-null   object 
 30  Asymmetrique Activity Index                    5022 non-null   object 
 31  Asymmetrique Profile Index                     5022 non-null   object 
 32  Asymmetrique Activity Score                    5022 non-null   float64
 33  Asymmetrique Profile Score                     5022 non-null   float64
 34  I agree to pay the amount through cheque       9240 non-null   object 
 35  A free copy of Mastering The Interview         9240 non-null   object 
 36  Last Notable Activity                          9240 non-null   object 
dtypes: float64(4), int64(3), object(30)
memory usage: 2.6+ MB

Initial thoughts and action plan: 1.

Initial thoughts and action plan:
  • Check duplicates
  • Drop Prospect ID and Lead Number (No additional information).
  • Reformat columns names without space and lowercase for practicity, and change some columns name for others names more intuitive.
  • Seems that we got a lot of binary columns, with only yes or no values etc. Convert it in binary encoding
  • Use just numbers instead of the mix of strings and integers in columns: Asymmetrique Activity/Profile Index
  • Do some text cleaning, some columns seems to have differents formats to present the same info.
  • Decide what to do with all the "Select" data, if count it as null or assign an other category like "Not Answered".
  • Handling missing values.

Check if the columns specified are really binary

In [34]:
binary_cats = ['Do Not Email','Do Not Call','Search','Magazine','Newspaper Article',
               'X Education Forums','Newspaper','Digital Advertisement','Through Recommendations',
               'Receive More Updates About Our Courses', 'Update me on Supply Chain Content','Get updates on DM Content',
               'I agree to pay the amount through cheque', 'A free copy of Mastering The Interview']

null_values = df[binary_cats].isnull().sum()
total = df[binary_cats].count()
yes_no = df[binary_cats].applymap(lambda x: 1 if x == 'Yes' or x == 'No' else 0).sum()
df_binary_cats = pd.DataFrame({'total': total,
                               'null_%': null_values/total*100,
                               'yes/no_%': yes_no/total*100})
df_binary_cats
Out[34]:
total null_% yes/no_%
Do Not Email 9240 0.0 100.0
Do Not Call 9240 0.0 100.0
Search 9240 0.0 100.0
Magazine 9240 0.0 100.0
Newspaper Article 9240 0.0 100.0
X Education Forums 9240 0.0 100.0
Newspaper 9240 0.0 100.0
Digital Advertisement 9240 0.0 100.0
Through Recommendations 9240 0.0 100.0
Receive More Updates About Our Courses 9240 0.0 100.0
Update me on Supply Chain Content 9240 0.0 100.0
Get updates on DM Content 9240 0.0 100.0
I agree to pay the amount through cheque 9240 0.0 100.0
A free copy of Mastering The Interview 9240 0.0 100.0
  • There's no missing values.
  • All the columns have only "Yes" or "No" values.

3.1. Separate train and test datasets¶

Let's separate train and test set before keep seeing more info.
Separating train and test data is essential to avoid data leakage, evaluate model generalization, and make unbiased performance assessments in machine learning. It ensures robust model development and reliable predictions on new, unseen data.

Why stratify by target label?
Stratifying train and test datasets in classification ensures balanced class representation, guarding against biased or imbalanced model learning. It promotes accurate evaluation, preventing skewed performance metrics.

In [35]:
train, test = train_test_split(df, test_size=.2, random_state=12, stratify=df['Converted'])
print(f'train shape: {train.shape}')
print(f'test shape: {test.shape}')
train shape: (7392, 37)
test shape: (1848, 37)

3.2. Inspecting only training dataset¶

In [36]:
print(f'In the train set are {train.duplicated().sum()} duplicates')
In the train set are 0 duplicates

Check the values in Asymmetrique Index columns

In [37]:
train['Asymmetrique Profile Index'].value_counts(dropna=False)
Out[37]:
NaN          3362
02.Medium    2243
01.High      1762
03.Low         25
Name: Asymmetrique Profile Index, dtype: int64
In [38]:
train['Asymmetrique Activity Index'].value_counts(dropna=False)
Out[38]:
NaN          3362
02.Medium    3080
01.High       648
03.Low        302
Name: Asymmetrique Activity Index, dtype: int64
Assymetrique's columns treatment:
    We have identified three distinct categories and some missing records. To improve the data's representativeness for machine learning modeling, we will focus on the integer values and reverse their order, emphasizing a higher-is-better perspective.

  • High: Assigned a value of 3
  • Medium: Assigned a value of 2
  • Low: Assigned a value of 1

4. Data cleaning & Feature Engineering

4.1 Data Cleaning¶

Let us embark on our first data cleaning endeavor! Our strategy involves transforming each step into Scikit-learn transformation objects, harmonizing the entire process into a unified pipeline.

Why is it a commendable practice to conduct all preprocessing tasks using Scikit-learn?
By encapsulating each step into transformation objects, we nurture modularity and reusability. This seamless integration in pipelines ensures consistent application to both training and test datasets, simplifying model selection and tuning while optimizing efficiency and scalability. Ultimately, this fosters a standardized and maintainable machine learning workflow.

In [39]:
def data_cleaning(df):
  """Do some of the data cleaning procedures that we
  specified at the begining of the notebook"""
  # drop columns id columns
  df = df.drop(['Prospect ID','Lead Number'], axis=1)

  # asymmetrique index columns transformation
  df['Asymmetrique Activity Index'] = df['Asymmetrique Activity Index'].str.split('.', expand=True)[0]\
                                                                        .str.replace('0','').str.replace('1','4')\
                                                                        .str.replace('3','1').str.replace('4','3')\
                                                                        .astype(np.float64
                                                                               )
  df['Asymmetrique Profile Index'] = df['Asymmetrique Profile Index'].str.split('.', expand=True)[0]\
                                                                        .str.replace('0','').str.replace('1','4')\
                                                                        .str.replace('3','1').str.replace('4','3')\
                                                                        .astype(np.float64
                                                                               )
  # binary encoding
  df[binary_cats] = df[binary_cats].applymap(lambda x: 0 if x == 'No' else 1)
    
  # rename columns for practicity
  df.columns = df.columns.str.replace(' ','_').str.lower()
  return df

# Convert custom function into transformer
initial_clean = FunctionTransformer(data_cleaning)

train_clean = initial_clean.fit_transform(train);

4.2 Inspecting category columns¶

In this stage, we'll first inspect the categorical columns from a practical and business-oriented perspective, before delving into more advanced statistical analysis.

I firmly believe that simplicity often holds the key to effective solutions.

The goal is to take a first look through all category columns to do some feature engineering, extract some initial thoughts for future EDA/feature engineer and handling missing values.

For the sake of the notebook's shortness, I omitted outputs. Feel free to download the code and check it yourself!

In [40]:
train_clean.lead_origin.value_counts(dropna=False);
In [41]:
train_clean.lead_source.value_counts(dropna=False);
Insights:
  • It's probable that NaN values originated from an "Other" source.
  • We should rename "google" to "Google."
  • Probably "Pay Per Click Ads" originated from "Google."
  • We can group "Referral Sites" with "Blog" and "WeLearnBlog_Home."
  • We can group "Live Chat" into the same category as "Olark Chat."
  • Group "bing" with "Organic Search."
  • Group "Click2call," "Social Media," "testone," "Press_release," "youtubechannel," "NC_EDM," and "WeLearn" into the "Other" category.
In [42]:
train_clean.last_activity.value_counts(dropna=False);
Insights:
  • Group "Email Received" with "SMS Sent."
  • Group "Email Marked Spam" with "Email Bounced" and "Unsubscribed" in a new category called "Not interested in email."
  • Group "Resubscribed to emails" with "Email Opened."
  • Group "Visited Booth in Tradeshow" and "View in browser link Clicked" into "Page Visited on Website," as they express the same level of interest, although not the same activity.
In [43]:
train_clean.country.value_counts(dropna=False);
Insights:
  • "Unknown" and "Asia/Pacific Region" could be classified as "NaN" as they don't add additional information.
  • Possibly group regions into "Europe" and "Rest of Asia/Oceania."
In [44]:
train_clean.specialization.value_counts(dropna=False);
Insights:
  • Group "E-COMMERCE" and "E-Business" into a single category called "e-commerce."
  • Group "Banking, Investment And Insurance" with "Finance Management."
  • Group "Media and Advertising" with "Marketing Management."
In [45]:
train_clean.how_did_you_hear_about_x_education.value_counts(dropna=False);
Insight:
  • Group "SMS" and "Email" into a category called "SMS/Email."
In [46]:
train_clean.what_is_your_current_occupation.value_counts(dropna=False);
In [47]:
train_clean.what_matters_most_to_you_in_choosing_a_course.value_counts(dropna=False);
Insight:
  • Considering the low occurrence of cases outside "Better Career Prospects" and "NaN," using the mode for replacement is prudent. However, thorough analysis and validation are crucial to avoid potential biases.
In [48]:
train_clean.tags.value_counts(dropna=False);
Insights:
  • Group "Invalid Number," "Wrong Number Given," "Number Not Provided," and "Oops Hangup" into a category called "Not Interested in Calls."
  • Group "Interested in Full-Time MBA," "In Confusion Whether Part-Time or DLP," and "Interested in Next Batch" into a category called "Shows Certain Interest."
  • Group "Lost to EINS," "In Touch with EINS," "Want to Take Admission but Has Financial Problems," "Recognition Issue (DEC Approval)," and "Graduation in Progress" as a new category called "Not Eligible for the Moment."
  • Group "University Not Recognized" and "Diploma Holder (Not Eligible)" into a category called "Not Eligible."
  • Group "Interested in Other Courses" and "Not Doing Further Education" as "Doesn't Show Interest."
  • Group "Busy," "Ringing," and "Switched Off" into a new category called "Still No Contact."
  • Group "Closed by Horizzon" in the same category as "Already a Student."
  • Group "Shall Take in the Next Month" and "Still Thinking" into a category called "Not Sure."
  • Consider grouping "Lateral Student," "Lost to Others," and the rest of the minor categories as "Others."
In [49]:
train_clean.lead_quality.value_counts(dropna=False);
Insights:
  • Possibly use `OrdinalEncoder`.
  • Missing values might not be assigned yet to a category; in that case, we can group those values in the same category as "Not Sure."
In [50]:
train_clean.lead_profile.value_counts(dropna=False);
In [51]:
train_clean.city.value_counts(dropna=False);
In [52]:
train_clean.last_notable_activity.value_counts(dropna=False);
Insight:
  • Apply the same procedure as used for the "Last Activity" column.
Data Manipulation:
Upon review, we found many "select" entries in columns, indicating potential blanks or unassigned values. We'll consider "select" as "Not Provided" instead of typical missing data in our analysis to better understand this pattern.
Possible Changes to Evaluate in Further Analysis:
  • Reduce "Assymetrique" columns, which might contain redundant information.
  • Consider the necessity of having both "last_activity" and "last_notable_activity" columns.
  • Possibly use `OrdinalEncoder` in the Lead Quality column.
  • Consider that missing values in the Lead Quality column might not be assigned yet to a category; in that case, they can be grouped with the same category as "Not Sure."
  • Investigate the possibility that NaN values in the Lead Source column originated from an "Other" source.
  • Possibly group regions in the country column into "Europe" and "Rest of Asia/Oceania."
  • Consider dropping the "tags" column.

4.3 Initial feature engineering¶

Apply initial changes described in the previous insights.

In [54]:
def initial_feature_engineering(df):
  """Do some feature engineering"""
  # lead_source
  df['lead_source'] = df['lead_source'].str.replace('|'.join(['google','Pay per Click Ads']),'Google')
  df['lead_source'] = df['lead_source'].apply(lambda x: "Referral Sites" if 'blog' in str(x) else x)
  df['lead_source'] = df['lead_source'].str.replace('Live Chat','Olark Chat')
  df['lead_source'] = df['lead_source'].str.replace('bing','Organic Search')
  df['lead_source'] = df[df['lead_source'] != 'Other'].lead_source.apply(lambda x: "Other" if str(x) not in train_clean.lead_source.value_counts()[:8].index else x)
  # last_activity and last_notable_activity
  activity = ['last_activity','last_notable_activity']
  df[activity] = df[activity].apply(lambda x: x.str.replace('|'.join(['Email Received','SMS Sent']),'SMS/Email Sent'))
  df[activity] = df[activity].apply(lambda x: x.str.replace('|'.join(['Email Marked Spam','Email Bounced','Unsubscribed']),'Not interested in email'))
  df[activity] = df[activity].apply(lambda x: x.str.replace('Resubscribed to emails','Email Opened'))
  df[activity] = df[activity].apply(lambda x: x.str.replace('|'.join(['Visited Booth in Tradeshow','View in browser link Clicked']),'Page Visited on Website'))
  # country
  df['country'] = df['country'].apply(lambda x: np.nan if x in ['Unknown','unknown','Asia/Pacific Region'] else x)
  # specialization
  df['specialization'] = df['specialization'].str.replace('|'.join(['E-COMMERCE','E-Business']),'E-commerce')
  df['specialization'] = df['specialization'].str.replace('Banking, Investment And Insurance','Finance Management')
  df['specialization'] = df['specialization'].str.replace('Media and Advertising','Marketing Management')
  df['specialization'] = df['specialization'].str.replace('Select','Not Provided')
  # how_did_you_hear
  df['how_did_you_hear_about_x_education'] = df['how_did_you_hear_about_x_education'].str.replace('Select','Not Provided')
  df['how_did_you_hear_about_x_education'] = df['how_did_you_hear_about_x_education'].str.replace('|'.join(['SMS','Email']),'SMS/Email')
  # importance_in_course
  df['what_matters_most_to_you_in_choosing_a_course'] = df['what_matters_most_to_you_in_choosing_a_course'].str.replace('|'.join(['Flexibility & Convenience','Other']),"Better Career Prospects")
  # lead_profile
  df['lead_profile'] = df['lead_profile'].str.replace('Select','Not Assigned')
  # city
  df['city'] = df['city'].str.replace('Select','Not Provided')

  return df

initial_feature_engineering = FunctionTransformer(initial_feature_engineering)
train_clean = initial_feature_engineering.fit_transform(train_clean);

5. Explore missing values

¶

Copy of the dataset and visualizations style

In [55]:
train_ = train_clean.copy()

# Set style for better visualizations
train_eda = train.copy()
sns.set_style('dark')
sns.set(rc={'axes.grid':False})
sns.set_palette('viridis')
In [56]:
null_ = pd.DataFrame()
null_['proportion'] = np.round(train_clean.isnull().sum()/len(train_clean),4) * 100
null_['amount'] = train_clean.isnull().sum()

# Show only those columns with at least 1 missing value
null_.sort_values(by='proportion', ascending=False)[null_.amount > 0]
Out[56]:
proportion amount
lead_quality 51.35 3796
asymmetrique_activity_index 45.48 3362
asymmetrique_profile_score 45.48 3362
asymmetrique_profile_index 45.48 3362
asymmetrique_activity_score 45.48 3362
tags 36.35 2687
lead_profile 29.40 2173
what_matters_most_to_you_in_choosing_a_course 29.40 2173
what_is_your_current_occupation 29.21 2159
country 26.50 1959
how_did_you_hear_about_x_education 23.92 1768
specialization 15.61 1154
city 15.41 1139
page_views_per_visit 1.45 107
totalvisits 1.45 107
last_activity 1.08 80
Insights:
  • Missing values in certain columns, often requiring employee input, might stem from uncategorized leads. Streamlining lead management can improve data collection, inform decision-making, and optimize lead conversion strategies. Further investigation is necessary to confirm this hypothesis.

Define some plot functions

In [57]:
def barplot_catcols(column,width,heigh):
  """Plot conversion rate"""
  fig, ax  = plt.subplots(figsize=(width,heigh))
  ax = sns.barplot(data=train_.fillna('NaN'), x='converted', y=column,
            order=order(train_.fillna('NaN'),column),
            orient='h', palette='viridis',
            seed=2)
  plt.title(f'Conversion Rate by {column.replace("_"," ").title()}', loc='left', size=18)
  return ax

def order(df,x,y=None):
    if y is not None:
        return df.groupby(x)[y].mean().sort_values(ascending=False).index
    else:
        return df.groupby(x)['converted'].mean().sort_values(ascending=False).index

5.1 How much of the missing values belong to the same people?¶

In [58]:
# Number of missing values in each row
train_['amount_missing'] = train_.isnull().sum(1)

# Plot the relation between amount missing and conversion rate
fig, ax  = plt.subplots(figsize=(8,5))
ax = sns.barplot(data=train_.fillna('NaN'), x='converted', y='amount_missing',
            orient='h', palette='viridis',
            seed=2)
plt.title(f'Conversion Rate by Amount Missing', loc='left', size=20)
plt.show()
In [59]:
fig, ax  = plt.subplots(figsize=(8,2))
ax = sns.barplot(data=train_, x='amount_missing', y='converted',
            orient='h', palette=sns.color_palette('viridis',2),
            seed=2)
plt.title(f'Amount missing by leads conversion', loc='left', size=18)
plt.show()

5.2 Correlation of numerical columns with converted column¶

In [60]:
correlations = train_.corr()['converted'].sort_values(ascending=False)

plt.figure(figsize=(8, 8))
correlations[1:].plot(kind='barh', 
                 color=sns.color_palette('viridis', len(correlations)))

plt.title('Correlation with the target variable', fontsize=20)
plt.xlabel('Correlation')
plt.ylabel('Features')
plt.show()
Insight:
  • There's a negative correlation between missing lead data and the conversion rate. Higher instances of missing data might signify incomplete or poorly managed lead information, leading to potential difficulties in accurately categorizing and nurturing leads.
  • Notably, there's one column, "Total Time Spent On Website," which has a stronger correlation with the target variable than the number of missing values columns. This suggests that it might be a better potential predictor than most of the other features.
In [61]:
print(f'Duplicate rows from original dataset: {train.duplicated().sum()}')
print(f'Duplicate rows after feature engineer: {train_clean.duplicated().sum()}')
Duplicate rows from original dataset: 0
Duplicate rows after feature engineer: 984
Handle Missing Values:
We currently lack sufficient information to determine the best approach for dealing with missing values. To address this, we will conduct a detailed data exploration, searching for patterns related to lead conversion. Once we have a clearer understanding, we can devise the most appropriate strategy for handling these missing records.

6. Exploratory Data Analysis

¶

Considering the prevalence of categorical or binary variables, we'll treat "NaN" values as a distinct category for comparison. For numerical columns with few "NaN" values, we'll exclude them to ensure robust analysis. This follows EDA best practices for gaining valuable insights from the dataset.

In [62]:
count = train_['converted'].value_counts()

fig, ax = plt.subplots(figsize=(10, 5))
ax.pie(count, labels=count.index, autopct='%1.1f%%', startangle=90, colors=['#29568CFF', '#3CBB75FF'])
ax.set_title('Converted', size=20)

centre_circle = plt.Circle((0,0),0.70,fc='white')
fig.gca().add_artist(centre_circle)

plt.axis('equal')
plt.show()
Insight:
The dataset exhibits a relatively balanced distribution of the target variable. While there may be some variations in class proportions, it's not extremely unbalanced.

Do we get redundant information among different columns?¶

In [63]:
train_.loc[:,'asymmetrique_activity_index':'asymmetrique_profile_score'].corr().style.background_gradient(cmap='vlag_r')
Out[63]:
  asymmetrique_activity_index asymmetrique_profile_index asymmetrique_activity_score asymmetrique_profile_score
asymmetrique_activity_index 1.000000 -0.145399 0.855985 -0.122669
asymmetrique_profile_index -0.145399 1.000000 -0.145366 0.883177
asymmetrique_activity_score 0.855985 -0.145366 1.000000 -0.114636
asymmetrique_profile_score -0.122669 0.883177 -0.114636 1.000000
Insight:
  • As expected, there's a strong correlation between the "Score" and "Index" columns. Given the level of detail in the data, retaining the score columns appears to be a sound choice. These columns appear to offer valuable information, and their inclusion in our analysis is likely to yield valuable insights.

5.1 Categorical variables¶

5.1.1 Profile scoring and classifier¶

In [64]:
fig, ax  = plt.subplots(1,2, figsize=(12,6), sharey=True)

sns.barplot(data=train_.fillna('NaN'), x='lead_profile', y='converted',
            palette='viridis', order=order(train_.fillna('NaN'),'lead_profile'),
            seed=2, ax=ax[0])
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=90)
ax[0].set_title(f'Conversion Rate by Lead Profile', loc='left', size=16)

sns.barplot(data=train_.fillna('NaN'), x='asymmetrique_profile_score', y='converted',
                  palette='viridis', order=order(train_.fillna('NaN'),'asymmetrique_profile_score'),
                    seed=2, ax=ax[1])
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=90)
ax[1].set_title(f'Conversion Rate by Asymmetrique Profile Score', loc='left', size=16)

plt.tight_layout()
plt.show()
Insights:
  • There's a significant difference in the conversion rate of people with "Not Assigned" and "NaN" values, which might suggest that they do not belong to the same group.
  • Profile Score could be a better predictor than Lead Profile, as the conversion rate tends to increase with higher scores.
  • Because both columns essentially represent the same information, it is advisable to drop the Lead Profile column for simplicity and clarity in the analysis.

5.1.2 Lead Activity¶

Correlation between activity track record (columns related with the web) and activity/profile score

In [65]:
activity_columns = ['totalvisits','total_time_spent_on_website','page_views_per_visit',
                    'asymmetrique_profile_score','asymmetrique_activity_score']

train_[activity_columns].corr().style.background_gradient(cmap='vlag_r')
Out[65]:
  totalvisits total_time_spent_on_website page_views_per_visit asymmetrique_profile_score asymmetrique_activity_score
totalvisits 1.000000 0.261952 0.598883 0.129016 -0.061397
total_time_spent_on_website 0.261952 1.000000 0.323684 0.167992 -0.066008
page_views_per_visit 0.598883 0.323684 1.000000 0.165945 -0.171264
asymmetrique_profile_score 0.129016 0.167992 0.165945 1.000000 -0.114636
asymmetrique_activity_score -0.061397 -0.066008 -0.171264 -0.114636 1.000000

Having columns about last activity and last notable activity provides more information?

In [66]:
fig, ax  = plt.subplots(1,2, figsize=(12,6), sharey=True)

sns.barplot(data=train_.fillna('NaN'), x='last_activity', y='converted',
            order=order(train_.fillna('NaN'),'last_activity'),
            palette='viridis',
            seed=2, ax=ax[0])
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=90)
ax[0].set_title(f'Conversion Rate by Last Activity', loc='left', size=16)

sns.barplot(data=train_.fillna('NaN'), x='last_notable_activity', y='converted',
                  order=order(train_.fillna('NaN'),'last_notable_activity'),
                  palette='viridis', seed=2)
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=90)
ax[1].set_title(f'Conversion Rate by Last Notable Activity', loc='left', size=16)

plt.tight_layout()
plt.show()
Insights:
  • Activity Score seems to be a less effective predictor compared to Profile Score.
  • No significant relationship was found between Activity Index and Last Activity, Last Notable Activity, or columns related to visits.
  • There's no significant correlation among columns related to visits.
  • Last Activity does not appear to provide substantially more information than Last Notable Activity. Hence, it may be preferable to retain Last Notable Activity.
Business Suggestion:
    Our analysis reveals a significant correlation between phone conversations and lead conversions. To maximize results, consider increasing phone calls to leads. Prioritizing "Hot Leads" for calls can enhance resource allocation and boost conversion rates, ultimately driving better business outcomes.

5.1.3 Lead Quality and Tags¶

In [67]:
barplot_catcols('lead_quality',8,3)
plt.show()
Insights:
    NaN rows and Not Sure have similar conversion rate, it's possible that belong to the same group of people with the difference that some employees qualify as Not Sure, and others simply don't fill that field.
In [68]:
fig, ax  = plt.subplots(figsize=(13,4))

sns.barplot(data=train_.fillna('NaN'), x='tags', y='converted',
            order=order(train_.fillna('NaN'),'tags'),
            palette='viridis',
            seed=2)
plt.xticks(rotation=90)
plt.title(f'Conversion Rate by Tags', loc='left', size=20)
plt.show()
Feature Engineer:
  • Group "Invalid Number," "Wrong Number Given," and "Number Not Provided" into a category called "Not Interested in Calls."
  • Group "In Confusion Whether Part-Time or DLP," "Interested in Next Batch," "Shall Take in the Next Month," and "Still Thinking" into a category called "Shows Certain Interest."
  • Group "University Not Recognized" and "Diploma Holder (Not Eligible)" into a new category called "Not Eligible."
  • Group "Interested in Other Courses," "Interested in Full-Time MBA," and "Not Doing Further Education" as "Doesn't Show Interest."
  • Group "Ringing" and "Switched Off" in a new category called "Still No Contact."
  • Group "Want to Take Admission But Has Financial Problems," "Recognition Issue (DEC Approval)," "Graduation in Progress" as a new category called "Not Eligible for the Moment."
  • "Lateral Student," "Lost to Others," and the rest of the minor categories might be grouped as "Others."

5.1.4 Ocupation and Specialization¶

In [69]:
fig, ax  = plt.subplots(1,2, figsize=(14,7), sharey=True)

sns.barplot(data=train_.fillna('NaN'), x='specialization', y='converted',
            order=order(train_.fillna('NaN'),'specialization'),
            palette='viridis',
            seed=2, ax=ax[0])
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=90)
ax[0].set_title(f'Conversion Rate by Specialization', loc='left', size=16)

sns.barplot(data=train_.fillna('NaN'), x='what_is_your_current_occupation', y='converted',
                  order=order(train_.fillna('NaN'),'what_is_your_current_occupation'),
                  palette='viridis', seed=2)
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=90)
ax[1].set_title(f'Conversion Rate by Occupation', loc='left', size=16)

plt.tight_layout()
plt.show()

Number of missing values for each row in these two categories

In [70]:
train_[['what_is_your_current_occupation','specialization']].isnull().sum(1).value_counts()
Out[70]:
0    5220
2    1141
1    1031
dtype: int64
Insights:
  • Almost all (>99%) of the people with missing records in the specialization column are the same people with missing values in the current position.
  • People with missing values in those categories have a significant difference in the conversion rate compared to the rest, including unemployed people.
  • Grouping people by current occupation seems to exhibit more noticeable differences in the conversion rate between each category compared to grouping by specialization.

5.1.5 Geographic data¶

In [71]:
conversion_country = train_.groupby('country')['converted'].mean()
country_count = train_['country'].value_counts().sort_index()

fig = go.Figure(data=go.Choropleth(
    locations=conversion_country.index,
    locationmode='country names',
    z=conversion_country.values,  
    text=country_count.values, 
    colorscale='deep',  
    colorbar_title='Conversion Rate',
    hovertemplate='%{location}<br>Conversion: %{z:.2f}<br>Count: %{text}',
))

fig.update_geos(projection_type="mercator")

fig.update_layout(
    title='Conversion Rate by Country',
    geo=dict(showcoastlines=True),
    font=dict(size=16),
)

fig.show()
In [72]:
train_['country'].value_counts().sort_index()
Out[72]:
Australia                  9
Bahrain                    6
Bangladesh                 2
Belgium                    2
Canada                     3
China                      2
Denmark                    1
France                     4
Germany                    4
Ghana                      2
Hong Kong                  3
India                   5201
Italy                      2
Kenya                      1
Kuwait                     3
Malaysia                   1
Netherlands                2
Nigeria                    4
Oman                       4
Philippines                1
Qatar                      9
Russia                     1
Saudi Arabia              17
Singapore                 21
South Africa               3
Sri Lanka                  1
Sweden                     2
Switzerland                1
Tanzania                   1
Uganda                     1
United Arab Emirates      49
United Kingdom            13
United States             57
Name: country, dtype: int64
In [73]:
barplot_catcols('city',8,4)
plt.show()

Is the geographic data correct?

In [74]:
print("Cities where country isn't India:")
train_[train_['country'] != 'India'].city.value_counts(dropna=False)
Cities where country isn't India:
Out[74]:
Not Provided                   992
NaN                            693
Mumbai                         244
Other Cities                    98
Thane & Outskirts               83
Other Cities of Maharashtra     49
Other Metro Cities              27
Tier II Cities                   5
Name: city, dtype: int64
In [75]:
print('Countries where City es equal to an Indian city:')
indian_cities = ['Mumbai','Thane & Outskirts','Other Cities of Maharashtra','Tier II Cities']
train_[train_.city.isin(indian_cities)].country.value_counts(dropna=False)
Countries where City es equal to an Indian city:
Out[75]:
India                   3220
NaN                      270
United States             32
United Arab Emirates      19
Singapore                 11
United Kingdom             9
Saudi Arabia               8
Australia                  6
Qatar                      5
Bahrain                    4
Germany                    3
Belgium                    2
Canada                     2
Netherlands                2
Kuwait                     1
France                     1
Sweden                     1
Malaysia                   1
Hong Kong                  1
Switzerland                1
Oman                       1
China                      1
Name: country, dtype: int64
Data Manipulation:
  1. We have incorrect information in the dataset. There are countries with cities that belong to India, so we can assume that due to the vast majority of customers from India in the dataset these countries should be replace it. Those rows that contains "Mumbai", "Thane & Outskirts", "Other Cities of Maharashtra" and "Tier II Cities" will be fill it with "India".
  2. Group all the countries that aren't in the top 5 most frequent in a category called "Other".
  3. People without information in the city column have a significantly lower conversion rate compared to others.

5.1.6 The source from which the customer heard about X Education and the source of the lead¶

In [76]:
fig, ax  = plt.subplots(1,2, figsize=(14,7), sharey=True)

sns.barplot(data=train_.fillna('NaN'), x='lead_source', y='converted',
            order=order(train_.fillna('NaN'),'lead_source'),
            palette='viridis',
            seed=2, ax=ax[0])
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=90)
ax[0].set_title(f'Conversion Rate by Lead Source', loc='left', size=16)

sns.barplot(data=train_.fillna('NaN'), x='how_did_you_hear_about_x_education', y='converted',
                  order=order(train_.fillna('NaN'),'how_did_you_hear_about_x_education'),
                  palette='viridis', seed=2)
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=90)
ax[1].set_title(f'Conversion Rate by How Did You Hear About It', loc='left', size=16)

plt.tight_layout()
plt.show()
Insights:
    Both columns present similar information. In that case, Lead Source seems to be a better potential predictor because of the lack of missing values. We'll keep only Lead Source.
Business Suggestion:
    Referrals, with a 90% conversion rate, are a top-performing lead source due to their trustworthiness. To capitalize on this potential, the business should incentivize, personalize, track, showcase testimonials, and leverage word-of-mouth marketing for effective growth.

5.2 Numeric variables¶

In [77]:
train_.select_dtypes(include=['number']).nunique().sort_values()
Out[77]:
i_agree_to_pay_the_amount_through_cheque       1
get_updates_on_dm_content                      1
update_me_on_supply_chain_content              1
receive_more_updates_about_our_courses         1
magazine                                       1
do_not_email                                   2
through_recommendations                        2
a_free_copy_of_mastering_the_interview         2
newspaper                                      2
digital_advertisement                          2
newspaper_article                              2
search                                         2
converted                                      2
do_not_call                                    2
x_education_forums                             2
asymmetrique_activity_index                    3
asymmetrique_profile_index                     3
asymmetrique_profile_score                    10
asymmetrique_activity_score                   11
amount_missing                                14
totalvisits                                   40
page_views_per_visit                         103
total_time_spent_on_website                 1635
dtype: int64
Data Manipulation:
    Almost all the numerical columns in the dataset represent binary outputs. But here's the catch: the vast majority of these columns have only one value, which is 0. That means they don't really offer any meaningful information to the model, and they're just cluttering up the dataset. Thus, it would be wise to drop these columns to improve model performance and reduce unnecessary noise in the data.

5.2.1 Columns related to web visits¶

In [78]:
fig, ax  = plt.subplots(3, figsize=(8,6))
sns.barplot(data=train_, x='totalvisits', y='converted',
            orient='h', palette='viridis',
            seed=2, ax=ax[0])
ax[0].set_title(f'Avg. Number of visits', loc='left', size=18)

sns.barplot(data=train_, x='total_time_spent_on_website', y='converted',
            orient='h', palette='viridis',
            seed=2, ax=ax[1])
ax[1].set_title(f'Avg. Time spent on website', loc='left', size=18)

sns.barplot(data=train_, x='page_views_per_visit', y='converted',
            orient='h', palette='viridis',
            seed=2, ax=ax[2])
ax[2].set_title(f'Avg. Page views per visit', loc='left', size=18)

plt.tight_layout()
plt.show()
In [79]:
fig, ax = plt.subplots(3,1, figsize=(8,6))
sns.boxplot(data=train_, x='totalvisits',
              ax=ax[0], palette='viridis')
ax[0].set_title('Total Visits', loc='left', size=16)

sns.boxplot(data=train_, x='total_time_spent_on_website',
              ax=ax[1], palette='viridis')
ax[1].set_title('Time spent on web', loc='left', size=16)

sns.boxplot(data=train_, x='page_views_per_visit',
              ax=ax[2], palette='viridis')
ax[2].set_title('Page views per visit', loc='left', size=16)

plt.tight_layout()
plt.show()
Insights:
  1. There's a significant difference in conversion rate among both groups.
  2. Leads that convert more spent much more time on the website.
Business Suggestion:
    To capitalize on this insight, we should enhance website engagement, optimize CTAs, and tailor content and offers based on lead preferences. Improving website navigation and implementing personalized lead nurturing campaigns can also boost conversion rates. Continuously monitoring performance and conducting A/B tests will allow for iterative improvements and better lead conversion outcomes.

7. Data Wrangling

¶

Outliers Treatment:

Addressing outliers in TotalVisits and Page Views Per Visit is essential for model performance, particularly in Logistic Regression. Capping these variables at the 95th percentile is recommended for model stability and preventing inflated coefficients. It enhances model generalization in various classification models like Decision Trees, Random Forests, and Support Vector Machines.

Missing Values Strategy:

  • Numeric Columns (KNN Imputation): Utilizing KNNImputer for imputing missing values in Total Visits and Page Views Per Visit is a preferable choice over median, mean, or mode imputation. KNNImputer considers feature relationships, preserving data distribution, and handling multicollinearity effectively.

  • Categorical Columns (Missing Category): Treating missing values as a separate category, rather than imputing with the mode, maintains data integrity, avoids biases, and improves model reliability and accuracy, especially considering the significant difference in conversion rate between leads with missing records and others.

7.1 Feature Engineer¶

Let's apply all the insights discovered during EDA.

In [80]:
def eda_feature_engineering(df):
  # tags column
  df['tags'] = df['tags'].str.replace('|'.join(['invalid number','wrong number given','number not provided']),'Not interest in calls')
  df['tags'] = df['tags'].str.replace('|'.join(["In confusion whether part time or DLP", "Interested in Next batch", "Shall take in the next coming month", "Still Thinking"]), "Shows certain interest")
  df['tags'] = df['tags'].str.replace("University not recognized","Not elegible")
  df['tags'] = df[df['tags'].notnull()].tags.apply(lambda x: 'Not elegible' if 'holder' in x else x)
  df['tags'] = df['tags'].str.replace('|'.join(["Interested in other courses", "Interested  in full time MBA", "Not doing further education"]),"Doesn't show interest")
  df['tags'] = df['tags'].str.replace('|'.join(["Ringing","switched off"]),"Still no contact")
  df['tags'] = df['tags'].str.replace('|'.join(["Want to take admission but has financial problems", "Graduation in progress"]),"Not elegible for the moment")
  df['tags'] = df[df['tags'].notnull()].tags.apply(lambda x: 'Not elegible for the moment' if 'Recognition' in x else x)
  df['tags'] = df[df['tags'].notnull()].tags.apply(lambda x: 'Other' if x not in df.tags.value_counts(dropna=False)[:12] else x)

  # country and city
  indian_cities = ['Mumbai','Thane & Outskirts','Other Cities of Maharashtra','Tier II Cities']
  df.loc[(df.country != 'India') & (df.city.isin(indian_cities)),'country'] = 'India'
  df['country'] = df.loc[df['country'].notnull(),'country'].apply(lambda x: 'Other' if x not in df.loc[df['country'] != 'Other','country'].value_counts()[:4] else x)

  # lead quality
  df['lead_quality'] = df['lead_quality'].fillna('Not Sure')

  # convert asymmetrique index columns in strings columns
  df[['asymmetrique_profile_index','asymmetrique_activity_index']] = df[['asymmetrique_profile_index','asymmetrique_activity_index']].astype(str)

  # drop columns with unique values
  drop_cols = ['magazine','receive_more_updates_about_our_courses','update_me_on_supply_chain_content',
               'get_updates_on_dm_content','i_agree_to_pay_the_amount_through_cheque']
  df = df.drop(drop_cols, axis=1)

  #add amount_missing column
  df['amount_missing'] = df.isnull().sum(1)
  return df

eda_feature_engineering = FunctionTransformer(eda_feature_engineering)

7.2 Handling Outliers¶

In [81]:
def cap_outliers(df):
  """Replace outliers with the 95th percentile"""
  num_cols = ['totalvisits','page_views_per_visit','total_time_spent_on_website']
  df[num_cols[0]].apply(lambda x: df[num_cols[0]].quantile(.95) if x > df[num_cols[0]].quantile(.95) else x)
  df[num_cols[1]].apply(lambda x: df[num_cols[1]].quantile(.95) if x > df[num_cols[1]].quantile(.95) else x)
  df[num_cols[2]].apply(lambda x: df[num_cols[2]].quantile(.95) if x > df[num_cols[2]].quantile(.95) else x)
  return df

cap_outliers = FunctionTransformer(cap_outliers);

7.3 Handling missing values and scaling columns for modeling¶

  1. Apply OneHotEncoder to all the categorical columns.
  2. Apply StandardScaler to the numeric columns if there aren't binary.
Note:
In the remainder='drop' process, we'll be removing the following columns, as identified during the EDA:
  • 'asymmetrique_profile_index'
  • 'asymmetrique_activity_index'
  • 'lead_profile'
  • 'last_activity'
  • 'specialization'
  • 'how_did_you_hear_about_x_education'
In [82]:
cat_columns = ['lead_origin','lead_source','country','what_is_your_current_occupation',
                'what_matters_most_to_you_in_choosing_a_course','tags','lead_quality',
                'city','last_notable_activity']

num_cols = ['totalvisits','page_views_per_visit','total_time_spent_on_website',
            'asymmetrique_activity_score','asymmetrique_profile_score','amount_missing']

impute_knn = KNNImputer(n_neighbors=5)
impute_cons = SimpleImputer(strategy='constant', fill_value='Missing')
ohe = OneHotEncoder(handle_unknown='ignore')
sc = StandardScaler()

# Make pipelines for both type of columns treatments
pipe_cat = make_pipeline(impute_cons,ohe)
pipe_num = make_pipeline(sc,impute_knn)

impute_scale = make_column_transformer(
                                        (pipe_cat, cat_columns),
                                        (pipe_num,num_cols),
                                        remainder='drop'
                                            )

7.5 Separate X and Y¶

In [83]:
X_train = train.drop('Converted',axis=1)
y_train = train.loc[:,'Converted']

7.6 Create an entire pipeline for all preprocessing steps!¶

Creating a comprehensive preprocessing pipeline for ML is essential for consistency, efficiency, and reproducibility. It prevents data leakage, simplifies scaling, and integrates hyperparameter tuning seamlessly. Such a pipeline also aids in model deployment, enhancing performance, and maintaining a reliable ML workflow.

In [84]:
pipe = make_pipeline(
                    initial_clean,
                    initial_feature_engineering,
                    eda_feature_engineering,
                    cap_outliers,
                    impute_scale
              )
# Let's see how it looks
pipe
Out[84]:
Pipeline(steps=[('functiontransformer-1',
                 FunctionTransformer(func=<function data_cleaning at 0x000001FE396D4AE0>)),
                ('functiontransformer-2',
                 FunctionTransformer(func=<function initial_feature_engineering at 0x000001FE381AB240>)),
                ('functiontransformer-3',
                 FunctionTransformer(func=<function eda_feature_engineering at 0x000001FE3BFD8540>)),
                ('functiontransformer-...
                                                   'what_is_your_current_occupation',
                                                   'what_matters_most_to_you_in_choosing_a_course',
                                                   'tags', 'lead_quality',
                                                   'city',
                                                   'last_notable_activity']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('standardscaler',
                                                                   StandardScaler()),
                                                                  ('knnimputer',
                                                                   KNNImputer())]),
                                                  ['totalvisits',
                                                   'page_views_per_visit',
                                                   'total_time_spent_on_website',
                                                   'asymmetrique_activity_score',
                                                   'asymmetrique_profile_score',
                                                   'amount_missing'])]))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('functiontransformer-1',
                 FunctionTransformer(func=<function data_cleaning at 0x000001FE396D4AE0>)),
                ('functiontransformer-2',
                 FunctionTransformer(func=<function initial_feature_engineering at 0x000001FE381AB240>)),
                ('functiontransformer-3',
                 FunctionTransformer(func=<function eda_feature_engineering at 0x000001FE3BFD8540>)),
                ('functiontransformer-...
                                                   'what_is_your_current_occupation',
                                                   'what_matters_most_to_you_in_choosing_a_course',
                                                   'tags', 'lead_quality',
                                                   'city',
                                                   'last_notable_activity']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('standardscaler',
                                                                   StandardScaler()),
                                                                  ('knnimputer',
                                                                   KNNImputer())]),
                                                  ['totalvisits',
                                                   'page_views_per_visit',
                                                   'total_time_spent_on_website',
                                                   'asymmetrique_activity_score',
                                                   'asymmetrique_profile_score',
                                                   'amount_missing'])]))])
FunctionTransformer(func=<function data_cleaning at 0x000001FE396D4AE0>)
FunctionTransformer(func=<function initial_feature_engineering at 0x000001FE381AB240>)
FunctionTransformer(func=<function eda_feature_engineering at 0x000001FE3BFD8540>)
FunctionTransformer(func=<function cap_outliers at 0x000001FE38373CE0>)
ColumnTransformer(transformers=[('pipeline-1',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='Missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['lead_origin', 'lead_source', 'country',
                                  'what_is_your_current_occupation',
                                  'what_matters_most_to_you_in_choosing_a_course',
                                  'tags', 'lead_quality', 'city',
                                  'last_notable_activity']),
                                ('pipeline-2',
                                 Pipeline(steps=[('standardscaler',
                                                  StandardScaler()),
                                                 ('knnimputer', KNNImputer())]),
                                 ['totalvisits', 'page_views_per_visit',
                                  'total_time_spent_on_website',
                                  'asymmetrique_activity_score',
                                  'asymmetrique_profile_score',
                                  'amount_missing'])])
['lead_origin', 'lead_source', 'country', 'what_is_your_current_occupation', 'what_matters_most_to_you_in_choosing_a_course', 'tags', 'lead_quality', 'city', 'last_notable_activity']
SimpleImputer(fill_value='Missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
['totalvisits', 'page_views_per_visit', 'total_time_spent_on_website', 'asymmetrique_activity_score', 'asymmetrique_profile_score', 'amount_missing']
StandardScaler()
KNNImputer()
Note:
Just a quick reminder, we've been working with a copy of the train dataset, so the original one is still in its raw form. Let's proceed to run the pipeline on the train dataset to preprocess and transform the data, preparing it for our machine learning algorithms.
In [85]:
X_train_pp = pipe.fit_transform(X_train)

8. Modeling

¶

We'll start by exploring models for potential strong performance. First, we'll evaluate them using cross-validation with stratified folds to maintain class proportions. The goal is to identify promising models before fine-tuning hyperparameters.


Let's Remember Our Initial Target:

"The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80."

So, with this in mind, we can select our most important performance measure.
In this case, we want to ensure that a high percentage of predicted leads convert to a customer, which means we're looking for a high precision score.

Does that mean we won't care about potential leads not detected (Low recall)?

Not at all. If we tune our models to optimize only for precision, we might be very accurate in their positive predictions but miss many actual positive cases. This translates into leaving money on the table—potential customers that won't convert.

Display function and SratifiedKFold

In [86]:
# Use stratified fold for ensure that we shuffle the dataset and conserve classes
skfold = StratifiedKFold(5, shuffle=True, random_state=12)

def display_scores(model,scores,pred):
  print(f'----------- {model} -----------')
  print('')
  print("------------------ Cross validation scores:")
  print("Scores:", scores)
  print("Mean:", scores.mean())
  print("Standard deviation:", scores.std())
  print('')
  print("--------------- Scores in the training set:")
  print("Precision:", precision_score(y_train,pred))
  print("Recall:", recall_score(y_train,pred))
  print("F1 score:", f1_score(y_train,pred))
  print("ROC - AUC score:", roc_auc_score(y_train,pred))

8.1 Logistic Regression¶

In [87]:
lr = LogisticRegression()
lr_scores = cross_val_score(lr, X_train_pp, y_train,
                            cv=skfold, scoring='f1')
lr.fit(X_train_pp,y_train)
lr_pred = lr.predict(X_train_pp)

# Precision and recall curve
lr_prec, lr_recall, lr_threshold = precision_recall_curve(y_train, lr_pred, pos_label=lr.classes_[1])
lr_prdisplay = PrecisionRecallDisplay(precision=lr_prec, recall=lr_recall)

# Display Scores
display_scores('Logistic Regression',lr_scores,lr_pred)
----------- Logistic Regression -----------

------------------ Cross validation scores:
Scores: [0.92017937 0.91607143 0.91785714 0.92252894 0.93191866]
Mean: 0.9217111080041696
Standard deviation: 0.005547398183110531

--------------- Scores in the training set:
Precision: 0.9390642002176278
Recall: 0.9087399087399087
F1 score: 0.9236532286835533
ROC - AUC score: 0.9358799697782749

8.2 Support Vector Machine¶

In [88]:
svc = SVC()
svc_scores = cross_val_score(svc, X_train_pp, y_train,
                             cv=skfold, scoring='f1')
svc.fit(X_train_pp, y_train)
svc_pred = svc.predict(X_train_pp)

# Precision and recall curve
svc_prec, svc_recall, svc_threshold = precision_recall_curve(y_train, svc_pred, pos_label=svc.classes_[1])
svc_prdisplay = PrecisionRecallDisplay(precision=svc_prec, recall=svc_recall)

# Display scores
display_scores('Support Vector Machine',svc_scores,svc_pred)
----------- Support Vector Machine -----------

------------------ Cross validation scores:
Scores: [0.92032229 0.92375887 0.93167702 0.9204947  0.93684211]
Mean: 0.9266189961289493
Standard deviation: 0.0065640139064668925

--------------- Scores in the training set:
Precision: 0.9438684304612084
Recall: 0.9266409266409267
F1 score: 0.9351753453772582
ROC - AUC score: 0.9460411324818105

8.3 Decission Trees¶

In [89]:
tree = DecisionTreeClassifier(random_state = 7)
tree_scores = cross_val_score(tree, X_train_pp, y_train,
                              cv=skfold, scoring='f1')
tree.fit(X_train_pp, y_train)
tree_pred = tree.predict(X_train_pp)

# Precision and recall curve
tree_prec, tree_recall, tree_threshold = precision_recall_curve(y_train, tree_pred, pos_label=tree.classes_[1])
tree_prdisplay = PrecisionRecallDisplay(precision=tree_prec, recall=tree_recall)

# Display scores
display_scores('Decission Tree',tree_scores,tree_pred)
----------- Decission Tree -----------

------------------ Cross validation scores:
Scores: [0.89533861 0.89821429 0.89938758 0.88736028 0.90394511]
Mean: 0.8968491718576317
Standard deviation: 0.005495095089980634

--------------- Scores in the training set:
Precision: 0.9912434325744308
Recall: 0.9933309933309933
F1 score: 0.9922861150070126
ROC - AUC score: 0.9939140108631633

8.4 Random Forest¶

In [90]:
rf = RandomForestClassifier(random_state=10,
                            oob_score=True)
rf_scores = cross_val_score(rf, X_train_pp, y_train,
                            cv=skfold, scoring='f1')
rf.fit(X_train_pp, y_train)
rf_pred = rf.predict(X_train_pp)
rf_pred_proba = rf.predict_proba(X_train_pp)

# Precision and recall curve
rf_prec, rf_recall, rf_threshold = precision_recall_curve(y_train, rf_pred_proba[:,1], pos_label=rf.classes_[1])
rf_prdisplay = PrecisionRecallDisplay(precision=rf_prec, recall=rf_recall)

# Display scores
display_scores('Random Forest',rf_scores,rf_pred)
print('Oob score: ',rf.oob_score_)
----------- Random Forest -----------

------------------ Cross validation scores:
Scores: [0.9204647  0.92086331 0.93167702 0.92537313 0.9434629 ]
Mean: 0.9283682120932953
Standard deviation: 0.008562210912321892

--------------- Scores in the training set:
Precision: 0.9908995449772489
Recall: 0.9936819936819937
F1 score: 0.9922888187872415
ROC - AUC score: 0.9939794516065702
Oob score:  0.9460227272727273

8.5 Gradient Boosting¶

In [91]:
xg = GradientBoostingClassifier(random_state=11)
xg_scores = cross_val_score(xg, X_train_pp, y_train,
                            cv=skfold, scoring='f1')
xg.fit(X_train_pp, y_train)
xg_pred = xg.predict(X_train_pp)

# Precision and recall curve
xg_prec, xg_recall, xg_threshold = precision_recall_curve(y_train, xg_pred, pos_label=xg.classes_[1])
xg_prdisplay = PrecisionRecallDisplay(precision=xg_prec, recall=xg_recall)

# Display scores
display_scores('Gradient Boosting',xg_scores,xg_pred)
----------- Gradient Boosting -----------

------------------ Cross validation scores:
Scores: [0.92072072 0.92558984 0.92844365 0.92665474 0.94044444]
Mean: 0.9283706783615786
Standard deviation: 0.006557141557698687

--------------- Scores in the training set:
Precision: 0.9537953795379538
Recall: 0.9129519129519129
F1 score: 0.9329268292682927
ROC - AUC score: 0.9426084680321968
First conclusions:

Most models performed well, except for Decision Tree, which overfit. Cross-validation results showed a strong average F1 score of approximately 0.92, indicating robust generalization even with simpler models like Logistic Regression.

Achieving Our Initial Goal:
We've reached our initial goal of exceeding an 80% precision threshold. To further improve the F1 Score, we'll focus on the top-performing models and fine-tune their hyperparameters.

9. Select the best models and tune them

¶

9.1 Recall - Precision Curve for each model¶

In [92]:
fig, ax = plt.subplots(figsize=(8,5))
lr_prdisplay.plot(ax=ax, label='Logistic Regression', color='blue', linewidth=2)
svc_prdisplay.plot(ax=ax, label='Support Vector Classifier', color='green', linewidth=2)
tree_prdisplay.plot(ax=ax, label='Decision Tree', color='red', linewidth=2, alpha=.9)
rf_prdisplay.plot(ax=ax, label='Random Forest', color='purple', linewidth=2, alpha=.7)
xg_prdisplay.plot(ax=ax, label='Gradient Boosting', color='orange', linewidth=2, alpha=.5)
plt.title('Precision Recall Curve (training data)', size=16, loc='left')
plt.show()

9.2 Logistic Regression¶

In [93]:
lr_params = [
              {'C': uniform(loc=0, scale=4),
              'penalty': ['l1','l2'],
              'solver': ['liblinear','saga']}
              ]

lr_randomcv = RandomizedSearchCV(lr, lr_params, cv=skfold,
                                 scoring='f1',
                                 return_train_score = True,
                                 random_state = 10,
                                 n_iter=100)

lr_randomcv.fit(X_train_pp, y_train)

print("---------------- Logistic Regression ---------------")
print("Best Parameters: ", lr_randomcv.best_params_)
print("Best Score: ", lr_randomcv.best_score_)
---------------- Logistic Regression ---------------
Best Parameters:  {'C': 1.4933630402058768, 'penalty': 'l1', 'solver': 'liblinear'}
Best Score:  0.9225907032920444

9.3 Random Forest¶

In [94]:
rf_params = [{
              'n_estimators': np.arange(50,500,50),
              'criterion': ['gini','entropy','logloss'],
              'max_depth': np.arange(2,14,2),
              'max_features': ['sqrt','log2',None, 0.5],
              }]

rf_randomcv = RandomizedSearchCV(rf, rf_params, cv=skfold,
                                 scoring='f1',
                                 return_train_score = True,
                                 random_state = 10,
                                 n_iter=100)

rf_randomcv.fit(X_train_pp, y_train)

print("----------------- Random Forest ----------------")
print("Best Parameters: ", rf_randomcv.best_params_)
print("Best Score: ", rf_randomcv.best_score_)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[94], line 14
      1 rf_params = [{
      2               'n_estimators': np.arange(50,500,50),
      3               'criterion': ['gini','entropy','logloss'],
      4               'max_depth': np.arange(2,14,2),
      5               'max_features': ['sqrt','log2',None, 0.5],
      6               }]
      8 rf_randomcv = RandomizedSearchCV(rf, rf_params, cv=skfold,
      9                                  scoring='f1',
     10                                  return_train_score = True,
     11                                  random_state = 10,
     12                                  n_iter=100)
---> 14 rf_randomcv.fit(X_train_pp, y_train)
     16 print("----------------- Random Forest ----------------")
     17 print("Best Parameters: ", rf_randomcv.best_params_)

File ~\anaconda3\Lib\site-packages\sklearn\model_selection\_search.py:874, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
    868     results = self._format_results(
    869         all_candidate_params, n_splits, all_out, all_more_results
    870     )
    872     return results
--> 874 self._run_search(evaluate_candidates)
    876 # multimetric is determined here because in the case of a callable
    877 # self.scoring the return type is only known after calling
    878 first_test_score = all_out[0]["test_scores"]

File ~\anaconda3\Lib\site-packages\sklearn\model_selection\_search.py:1768, in RandomizedSearchCV._run_search(self, evaluate_candidates)
   1766 def _run_search(self, evaluate_candidates):
   1767     """Search n_iter candidates from param_distributions"""
-> 1768     evaluate_candidates(
   1769         ParameterSampler(
   1770             self.param_distributions, self.n_iter, random_state=self.random_state
   1771         )
   1772     )

File ~\anaconda3\Lib\site-packages\sklearn\model_selection\_search.py:821, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
    813 if self.verbose > 0:
    814     print(
    815         "Fitting {0} folds for each of {1} candidates,"
    816         " totalling {2} fits".format(
    817             n_splits, n_candidates, n_candidates * n_splits
    818         )
    819     )
--> 821 out = parallel(
    822     delayed(_fit_and_score)(
    823         clone(base_estimator),
    824         X,
    825         y,
    826         train=train,
    827         test=test,
    828         parameters=parameters,
    829         split_progress=(split_idx, n_splits),
    830         candidate_progress=(cand_idx, n_candidates),
    831         **fit_and_score_kwargs,
    832     )
    833     for (cand_idx, parameters), (split_idx, (train, test)) in product(
    834         enumerate(candidate_params), enumerate(cv.split(X, y, groups))
    835     )
    836 )
    838 if len(out) < 1:
    839     raise ValueError(
    840         "No fits were performed. "
    841         "Was the CV iterator empty? "
    842         "Were there no candidates?"
    843     )

File ~\anaconda3\Lib\site-packages\sklearn\utils\parallel.py:63, in Parallel.__call__(self, iterable)
     58 config = get_config()
     59 iterable_with_config = (
     60     (_with_config(delayed_func, config), args, kwargs)
     61     for delayed_func, args, kwargs in iterable
     62 )
---> 63 return super().__call__(iterable_with_config)

File ~\anaconda3\Lib\site-packages\joblib\parallel.py:1088, in Parallel.__call__(self, iterable)
   1085 if self.dispatch_one_batch(iterator):
   1086     self._iterating = self._original_iterator is not None
-> 1088 while self.dispatch_one_batch(iterator):
   1089     pass
   1091 if pre_dispatch == "all" or n_jobs == 1:
   1092     # The iterable was consumed all at once by the above for loop.
   1093     # No need to wait for async callbacks to trigger to
   1094     # consumption.

File ~\anaconda3\Lib\site-packages\joblib\parallel.py:901, in Parallel.dispatch_one_batch(self, iterator)
    899     return False
    900 else:
--> 901     self._dispatch(tasks)
    902     return True

File ~\anaconda3\Lib\site-packages\joblib\parallel.py:819, in Parallel._dispatch(self, batch)
    817 with self._lock:
    818     job_idx = len(self._jobs)
--> 819     job = self._backend.apply_async(batch, callback=cb)
    820     # A job can complete so quickly than its callback is
    821     # called before we get here, causing self._jobs to
    822     # grow. To ensure correct results ordering, .insert is
    823     # used (rather than .append) in the following line
    824     self._jobs.insert(job_idx, job)

File ~\anaconda3\Lib\site-packages\joblib\_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback)
    206 def apply_async(self, func, callback=None):
    207     """Schedule a func to be run"""
--> 208     result = ImmediateResult(func)
    209     if callback:
    210         callback(result)

File ~\anaconda3\Lib\site-packages\joblib\_parallel_backends.py:597, in ImmediateResult.__init__(self, batch)
    594 def __init__(self, batch):
    595     # Don't delay the application, to avoid keeping the input
    596     # arguments in memory
--> 597     self.results = batch()

File ~\anaconda3\Lib\site-packages\joblib\parallel.py:288, in BatchedCalls.__call__(self)
    284 def __call__(self):
    285     # Set the default nested backend to self._backend but do not set the
    286     # change the default number of processes to -1
    287     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288         return [func(*args, **kwargs)
    289                 for func, args, kwargs in self.items]

File ~\anaconda3\Lib\site-packages\joblib\parallel.py:288, in <listcomp>(.0)
    284 def __call__(self):
    285     # Set the default nested backend to self._backend but do not set the
    286     # change the default number of processes to -1
    287     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288         return [func(*args, **kwargs)
    289                 for func, args, kwargs in self.items]

File ~\anaconda3\Lib\site-packages\sklearn\utils\parallel.py:123, in _FuncWrapper.__call__(self, *args, **kwargs)
    121     config = {}
    122 with config_context(**config):
--> 123     return self.function(*args, **kwargs)

File ~\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py:686, in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
    684         estimator.fit(X_train, **fit_params)
    685     else:
--> 686         estimator.fit(X_train, y_train, **fit_params)
    688 except Exception:
    689     # Note fit time as time until error
    690     fit_time = time.time() - start_time

File ~\anaconda3\Lib\site-packages\sklearn\ensemble\_forest.py:473, in BaseForest.fit(self, X, y, sample_weight)
    462 trees = [
    463     self._make_estimator(append=False, random_state=random_state)
    464     for i in range(n_more_estimators)
    465 ]
    467 # Parallel loop: we prefer the threading backend as the Cython code
    468 # for fitting the trees is internally releasing the Python GIL
    469 # making threading more efficient than multiprocessing in
    470 # that case. However, for joblib 0.12+ we respect any
    471 # parallel_backend contexts set at a higher level,
    472 # since correctness does not rely on using threads.
--> 473 trees = Parallel(
    474     n_jobs=self.n_jobs,
    475     verbose=self.verbose,
    476     prefer="threads",
    477 )(
    478     delayed(_parallel_build_trees)(
    479         t,
    480         self.bootstrap,
    481         X,
    482         y,
    483         sample_weight,
    484         i,
    485         len(trees),
    486         verbose=self.verbose,
    487         class_weight=self.class_weight,
    488         n_samples_bootstrap=n_samples_bootstrap,
    489     )
    490     for i, t in enumerate(trees)
    491 )
    493 # Collect newly grown trees
    494 self.estimators_.extend(trees)

File ~\anaconda3\Lib\site-packages\sklearn\utils\parallel.py:63, in Parallel.__call__(self, iterable)
     58 config = get_config()
     59 iterable_with_config = (
     60     (_with_config(delayed_func, config), args, kwargs)
     61     for delayed_func, args, kwargs in iterable
     62 )
---> 63 return super().__call__(iterable_with_config)

File ~\anaconda3\Lib\site-packages\joblib\parallel.py:1088, in Parallel.__call__(self, iterable)
   1085 if self.dispatch_one_batch(iterator):
   1086     self._iterating = self._original_iterator is not None
-> 1088 while self.dispatch_one_batch(iterator):
   1089     pass
   1091 if pre_dispatch == "all" or n_jobs == 1:
   1092     # The iterable was consumed all at once by the above for loop.
   1093     # No need to wait for async callbacks to trigger to
   1094     # consumption.

File ~\anaconda3\Lib\site-packages\joblib\parallel.py:901, in Parallel.dispatch_one_batch(self, iterator)
    899     return False
    900 else:
--> 901     self._dispatch(tasks)
    902     return True

File ~\anaconda3\Lib\site-packages\joblib\parallel.py:819, in Parallel._dispatch(self, batch)
    817 with self._lock:
    818     job_idx = len(self._jobs)
--> 819     job = self._backend.apply_async(batch, callback=cb)
    820     # A job can complete so quickly than its callback is
    821     # called before we get here, causing self._jobs to
    822     # grow. To ensure correct results ordering, .insert is
    823     # used (rather than .append) in the following line
    824     self._jobs.insert(job_idx, job)

File ~\anaconda3\Lib\site-packages\joblib\_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback)
    206 def apply_async(self, func, callback=None):
    207     """Schedule a func to be run"""
--> 208     result = ImmediateResult(func)
    209     if callback:
    210         callback(result)

File ~\anaconda3\Lib\site-packages\joblib\_parallel_backends.py:597, in ImmediateResult.__init__(self, batch)
    594 def __init__(self, batch):
    595     # Don't delay the application, to avoid keeping the input
    596     # arguments in memory
--> 597     self.results = batch()

File ~\anaconda3\Lib\site-packages\joblib\parallel.py:288, in BatchedCalls.__call__(self)
    284 def __call__(self):
    285     # Set the default nested backend to self._backend but do not set the
    286     # change the default number of processes to -1
    287     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288         return [func(*args, **kwargs)
    289                 for func, args, kwargs in self.items]

File ~\anaconda3\Lib\site-packages\joblib\parallel.py:288, in <listcomp>(.0)
    284 def __call__(self):
    285     # Set the default nested backend to self._backend but do not set the
    286     # change the default number of processes to -1
    287     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288         return [func(*args, **kwargs)
    289                 for func, args, kwargs in self.items]

File ~\anaconda3\Lib\site-packages\sklearn\utils\parallel.py:123, in _FuncWrapper.__call__(self, *args, **kwargs)
    121     config = {}
    122 with config_context(**config):
--> 123     return self.function(*args, **kwargs)

File ~\anaconda3\Lib\site-packages\sklearn\ensemble\_forest.py:184, in _parallel_build_trees(tree, bootstrap, X, y, sample_weight, tree_idx, n_trees, verbose, class_weight, n_samples_bootstrap)
    181     elif class_weight == "balanced_subsample":
    182         curr_sample_weight *= compute_sample_weight("balanced", y, indices=indices)
--> 184     tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
    185 else:
    186     tree.fit(X, y, sample_weight=sample_weight, check_input=False)

File ~\anaconda3\Lib\site-packages\sklearn\tree\_classes.py:889, in DecisionTreeClassifier.fit(self, X, y, sample_weight, check_input)
    859 def fit(self, X, y, sample_weight=None, check_input=True):
    860     """Build a decision tree classifier from the training set (X, y).
    861 
    862     Parameters
   (...)
    886         Fitted estimator.
    887     """
--> 889     super().fit(
    890         X,
    891         y,
    892         sample_weight=sample_weight,
    893         check_input=check_input,
    894     )
    895     return self

File ~\anaconda3\Lib\site-packages\sklearn\tree\_classes.py:379, in BaseDecisionTree.fit(self, X, y, sample_weight, check_input)
    368 else:
    369     builder = BestFirstTreeBuilder(
    370         splitter,
    371         min_samples_split,
   (...)
    376         self.min_impurity_decrease,
    377     )
--> 379 builder.build(self.tree_, X, y, sample_weight)
    381 if self.n_outputs_ == 1 and is_classifier(self):
    382     self.n_classes_ = self.n_classes_[0]

KeyboardInterrupt: 

9.4 Gradient Boosting¶

In [ ]:
xg_params = [{
              'n_estimators': np.arange(50,500,50),
              'loss': ['exponential','log_loss'],
              'max_depth': np.arange(2,14,2),
              'criterion': ['friedman_mse', 'squared_error'],
              'learning_rate': uniform(loc=0,scale=.5),
              'max_features': ['sqrt', 'log2', None, 0.5]
              }]

xg_randomcv = RandomizedSearchCV(xg, xg_params, cv=skfold,
                                 scoring='f1',
                                 return_train_score = True,
                                 random_state = 10,
                                 n_iter=50)

xg_randomcv.fit(X_train_pp, y_train)

print("--------------- Gradient Boosting --------------")
print("Best Parameters: ", xg_randomcv.best_params_)
print("Best Score: ", xg_randomcv.best_score_)
Performance Overview:
The performance of all three models remains consistently impressive. Notably, the Random Forest model achieved the highest score, albeit with a slight improvement. Every increase counts, and in this case, a marginal gain of approximately 0.50% in the F1 score underscores its suitability for prediction on the test dataset.

10. Make our predictions

¶

At this point, we've already:

  1. Completed the entire data preprocessing and exploration.
  2. We exclusively used the training dataset to eliminate any potential human bias.
  3. Additionally, we've incorporated all the preprocessing steps into a pipeline to prevent any data leakage.
  4. Next, we selected the most promising models (without tuning) and applied cross-validation to assess their performance.
  5. Following that, we fine-tuned those models using RandomizedSearchCV and identified the best one.

By following these steps, we ensure that the data will be treated as if it were completely new. Now, we're all set to apply the entire pipeline and predict the lead scores using the test dataset!

10.1 Apply al the preprocessing pipeline to the test dataset¶

In [ ]:
X_test = test.drop('Converted',axis=1)
y_test = test.loc[:,'Converted']

# Let's take a look of the first row
X_test.to_numpy()[:1]
In [ ]:
# apply all the preprocessing steps to the test dataset
X_test_pp = pipe.transform(X_test)
X_test_pp.toarray()[:1]

10.2 Random Forest with hyperparameter tuned¶

In [ ]:
rf_rcv_pred = rf_randomcv.predict(X_test_pp)
print("Precision:", precision_score(y_test,rf_rcv_pred))
print("Recall:", recall_score(y_test,rf_rcv_pred))
print("F1 score:", f1_score(y_test,rf_rcv_pred))
print("ROC - AUC score:", roc_auc_score(y_test,rf_rcv_pred))

10.3 Random Forest without hyperparameter tuned¶

In [ ]:
rf_pred_test = rf.predict(X_test_pp)
print("Precision:", precision_score(y_test,rf_pred_test))
print("Recall:", recall_score(y_test,rf_pred_test))
print("F1 score:", f1_score(y_test,rf_pred_test))
print("ROC - AUC score:", roc_auc_score(y_test,rf_pred_test))

Upon closer examination, a marginal improvement of 0.0006 in F1 score appears evident in the untuned model. However, as previously emphasized, our focus primarily rested on elevating precision rather than recall. Given the subtle disparity in F1 scores, the preference leans towards the tuned model. This decision stems from a higher increase of roughly 0.0079 in precision, aligning well with our objectives and priorities.

10.4 Let's plot the confussion matrix for both models!¶

In [ ]:
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# Random Forest tunned
cm1 = confusion_matrix(y_test, rf_rcv_pred)
sns.heatmap(cm1, annot=True, fmt = 'd', cmap='Greens', ax = ax[0], cbar=False)
ax[0].xaxis.set_ticklabels(['Not converted', 'Converted'])
ax[0].yaxis.set_ticklabels(['Not converted', 'Converted'])
ax[0].set_title('RF with hyperparameters tuning', loc='left')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('True')

# Random Forest without tuning
cm2 = confusion_matrix(y_test, rf_pred_test)
sns.heatmap(cm2, annot=True, fmt='d', cmap='Blues', ax=ax[1], cbar=False)
ax[1].xaxis.set_ticklabels(['Not converted', 'Converted'])
ax[1].yaxis.set_ticklabels(['Not converted', 'Converted'])
ax[1].set_title('RF without hyperparameters tuning', loc='left')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('True')

plt.tight_layout()
plt.show()

10.5 Submission¶

Class predictions in the left, and probabilities to convert into a customer on the right.

In [ ]:
lead_scoring = rf_randomcv.predict_proba(X_test_pp)[:,1]
lead_prediction = rf_rcv_pred
results = np.round(np.c_[lead_prediction,lead_scoring],2)

# Let's take a look of the first 10 rows
results[:10]

11. Conclusions

¶

In summary, our data science project focused on fine-tuning lead scoring for X Education. We aimed to exceed an 80% precision goal, which we not only met but exceeded. Throughout our journey, we identified key factors like phone interactions, referrals, and online engagement that strongly correlated with lead conversion, leading to actionable strategies.

One notable achievement was the development of an automated lead scoring algorithm that not only improved lead assessment precision but also streamlined operational efficiency. By targeting promising leads, X Education could reduce sales team costs significantly.

Our journey involved thorough data exploration, preprocessing, and model development, ensuring consistency and mitigating bias. We systematically evaluated models, with the tuned Random Forest model achieving an impressive F1 score of 0.9287 and a precision score of 0.9527 on the test dataset.

This data-driven journey provides X Education with actionable insights to enhance efficiency and revenue growth, positioning the company for a transformative phase.

If you've read until here, thank you. I hope you found this information helpful and interesting in some way. Your feedback is greatly appreciated. Best regards.